What are we trying to study ?
The Enzyme Commission (EC) system is a widely accepted classification system used to categorize enzymes based on their catalytic activities. Enzymes play crucial roles in biological processes by accelerating chemical reactions and facilitating various metabolic pathways within living organisms. The EC system assigns a unique EC number to each enzyme, which provides valuable information about its function and specificity.
EC1 represents the class of enzymes known as oxidoreductases. These enzymes catalyze oxidation-reduction reactions, which involve the transfer of electrons between molecules. Oxidoreductases are involved in a wide range of biological processes, including energy production, biosynthesis, and detoxification. Examples of oxidoreductases include dehydrogenases, oxidases, reductases, and peroxidases.- The internet
First, we import the required libraries which we will use to perform the current analysis.
library(tidyverse)
library(naniar)
library(bookdown)
library(stringr)
library(stringi)
library(lubridate)
library(DT)
library(forcats)
library(ggthemes)
library(corrplot)
library(mltools)
library(data.table)
library(visdat)
library(janitor)
library(cowplot)
library(caTools)
library(pscl)
library(ROCR)
library(caret)
library(xgboost)
library(randomForest)
library(lightgbm)
library(Matrix)
library(catboost)
library(magrittr)
library(fmsb)
Great ! We have all the libraries loaded. Next, we are gonna load the required dataset for conducting the enzyme classification analysis.
We will use one dataset for the purpose of exploratory data analysis and training the classification model while the test dataset for testing the classification model on a completely new dataset.
After reading the data, let us see how the train dataset looks like.
df_train <- read_csv("data/train.csv")
df_test <- read_csv("data/test.csv")
head(df_train)
## # A tibble: 6 × 38
## id BertzCT Chi1 Chi1n Chi1v Chi2n Chi2v Chi3v Chi4n EState_VSA1
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 323. 9.88 5.88 5.88 4.30 4.30 2.75 1.75 0
## 2 1 274. 7.26 4.44 5.83 3.29 4.49 2.20 1.29 45.1
## 3 2 522. 10.9 8.53 11.1 6.67 9.52 5.82 1.77 15.6
## 4 3 567. 12.5 7.09 12.8 6.48 11.0 7.91 3.07 95.6
## 5 4 113. 4.41 2.87 2.87 1.88 1.88 1.04 0.728 18.0
## 6 5 145. 5.54 3.48 3.48 2.48 2.48 1.51 0.671 36.9
## # ℹ 28 more variables: EState_VSA2 <dbl>, ExactMolWt <dbl>,
## # FpDensityMorgan1 <dbl>, FpDensityMorgan2 <dbl>, FpDensityMorgan3 <dbl>,
## # HallKierAlpha <dbl>, HeavyAtomMolWt <dbl>, Kappa3 <dbl>,
## # MaxAbsEStateIndex <dbl>, MinEStateIndex <dbl>, NumHeteroatoms <dbl>,
## # PEOE_VSA10 <dbl>, PEOE_VSA14 <dbl>, PEOE_VSA6 <dbl>, PEOE_VSA7 <dbl>,
## # PEOE_VSA8 <dbl>, SMR_VSA10 <dbl>, SMR_VSA5 <dbl>, SlogP_VSA3 <dbl>,
## # VSA_EState9 <dbl>, fr_COO <dbl>, fr_COO2 <dbl>, EC1 <dbl>, EC2 <dbl>, …
We can observe that there are multiple process parameters present in the dataset which can help us analyse and predict the values of EC1 and EC2. But what do all these variables tell us ? Based on the information from Panmie’s Kaggle notebook, following are the explanations of each of these variables.
Id: This feature represents the identifier or unique identification number of a molecule. It serves as a reference but doesn’t directly contribute to the predictive model.
BertzCT: This feature corresponds to the Bertz complexity index, which measures the structural complexity of a molecule. It can provide insights into the intricacy of molecular structures.
Chi1 : The Chi1 feature denotes the 1st order molecular connectivity index, which describes the topological connectivity of atoms in a molecule. It characterizes the atomic bonding pattern within the molecule.
Chi1n : This feature is the normalized version of the Chi1 index. It allows for standardized comparisons of the 1st order molecular connectivity across different molecules.
Chi1v : The Chi1v feature represents the 1st order molecular variance connectivity index. It captures the variance or diversity in the connectivity of atoms within a molecule.
Chi2n : The Chi2n feature indicates the 2nd order molecular connectivity index, which provides information about the extended connectivity of atoms in a molecule. It considers the neighboring atoms of each atom in the molecule.
Chi2v : Similar to Chi2n, the Chi2v feature measures the variance or diversity in the extended connectivity of atoms within a molecule at the 2nd order level.
Chi3v : The Chi3v feature represents the 3rd order molecular variance connectivity index. It captures the variance in the 3rd order connectivity patterns among atoms in a molecule.
Chi4n : This feature corresponds to the 4th order molecular connectivity index, which provides information about the extended connectivity of atoms in a molecule. The Chi4n index is normalized to allow for consistent comparisons across molecules.
EState_VSA1 : EState_VSA1 is a feature that relates to the electrotopological state of a molecule. Specifically, it represents the Van der Waals surface area contribution for a specific atom type, contributing to the overall electrotopological state.
EState_VSA2 : Similar to EState_VSA1, EState_VSA2 also represents the electrotopological state but for a different specific atom type.
ExactMolWt : This feature denotes the exact molecular weight of a molecule. It provides an accurate measurement of the mass of the molecule.
FpDensityMorgan1 : FpDensityMorgan1 represents the Morgan fingerprint density for a specific radius of 1. Morgan fingerprints are a method for generating molecular fingerprints, and this feature captures the density of those fingerprints.
FpDensityMorgan2 : Similar to FpDensityMorgan1, this feature represents the Morgan fingerprint density for a specific radius of 2.
FpDensityMorgan3 : FpDensityMorgan3 corresponds to the Morgan fingerprint density for a specific radius of 3.
HallkierAlpha : The HallkierAlpha feature denotes the Hall-Kier alpha value for a molecule. It is a measure of molecular shape and can provide insights into the overall structure of the molecule.
HeavyAtomMolWt : This feature represents the molecular weight of heavy atoms only, excluding hydrogen atoms. It focuses on the mass of non-hydrogen atoms within the molecule.
Kappa3 : The Kappa3 feature corresponds to the Hall-Kier Kappa3 value, which is a molecular shape descriptor. It provides information about the shape and spatial arrangement of atoms within the molecule.
MaxAbsEStateIndex : This feature represents the maximum absolute value of the E-state index. The E-state index relates to the electronic properties of a molecule, and its maximum absolute value can indicate the presence of specific electronic characteristics.
MinEStateIndex : MinEStateIndex denotes the minimum value of the E-state index. It provides information about the lowest observed electronic property value within the molecule.
NumHeteroatoms : This feature indicates the number of heteroatoms present in a molecule. Heteroatoms are atoms other than carbon and hydrogen, such as oxygen, nitrogen, sulfur, etc. This feature provides insights into the diversity and composition of atoms within the molecule.
PEOE_VSA10 : PEOE_VSA10 represents the partial equalization of orbital electronegativity Van der Waals surface area contribution for a specific atom type. It captures the surface area contribution of a particular atom type to the overall electrostatic properties.
PEOE_VSA14 : Similar to PEOE_VSA10, PEOE_VSA14 also represents the partial equalization of orbital electronegativity Van der Waals surface area contribution for a specific atom type.
PEOE_VSA6 : This feature corresponds to the partial equalization of orbital electronegativity Van der Waals surface area contribution for a specific atom type at a different level.
PEOE_VSA7 : Similar to PEOE_VSA6, PEOE_VSA7 represents the partial equalization of orbital electronegativity Van der Waals surface area contribution for a specific atom type.
PEOE_VSA8 : PEOE_VSA8 denotes the partial equalization of orbital electronegativity Van der Waals surface area contribution for a specific atom type.
SMR_VSA10 : SMR_VSA10 represents the solvent-accessible surface area Van der Waals surface area contribution for a specific atom type. It captures the contribution of a specific atom type to the solvent-accessible surface area.
SMR_VSA5 : Similar to SMR_VSA10, this feature denotes the solvent-accessible surface area Van der Waals surface area contribution for a specific atom type at a different level.
SlogP_VSA3 : The SlogP_VSA3 feature represents the LogP-based surface area contribution. It captures the contribution of a specific atom type to the surface area based on its logarithmic partition coefficient.
VSA_EState9 : This feature denotes the E-state fragment contribution for the Van der Waals surface area calculation. It captures the fragment-specific contribution to the electrostatic properties of the molecule.
fr_COO : The fr_COO feature represents the number of carboxyl (COO) functional groups present in the molecule. It ranges from 0 to 8, providing insights into the presence and abundance of carboxyl groups.
fr_COO2 : Similar to fr_COO, fr_COO2 represents the number of carboxyl (COO) functional groups, ranging from 0 to 8.
EC1 : EC1 is a binary feature representing a predicted label related to Oxidoreductases. It serves as one of the target variables for prediction.
EC2 : EC2 is another binary feature representing a predicted label related to Transferases. It serves as another target variable for prediction.
In the first section, we will try to remove all the variables that will not be required for our analysis.
df_train <- df_train %>% select(-c("id","EC3","EC4","EC5","EC6"))
In this step, we will try to check for the presence of null values in the dataset.
gg_miss_var(df_train)
Figure 3.1: Missingness in the dataset
Based on the figure 3.1, we can observe that
✅ The dataset does not contain any missing values. This indicates that we have a clean dataset which is ready for EDA and further analysis.
We can observe that there are a total of 32 variables in the current dataset !!! These are a lot more than what we would ideally like to analyse. Such types of datasets require a special kind of analysis called as High Dimensional Data Analysis which concentrate majorly on techniques such as clustering and pricipal component analysis to reduce the number of variables without completely losing data. While this is the right way to go about it, this notebook will however study the correlation of each variable and try to reduce the number of variables which are observed to show high multi-collinearity.
Let us understand how each of these variables correlate.
corrplot(cor(df_train), # Correlation matrix
method = "number", # Correlation plot method
type = "full", # Correlation plot style (also "upper" and "lower")
diag = TRUE, # If TRUE (default), adds the diagonal
tl.col = "black", # Labels color
bg = "white", # Background color
title = "Correlation plot", # Main title
col = NULL,
number.cex = 0.4,
tl.cex = .5)
Figure 4.1: Correlation plot
As we can observe from figure 4.1,
None of the variables have an unusually high correlation with EC1 or EC2 . However, we do observe multiple variables which have high correlation to each other. This pheonmenon is called multi-collinearity . Let us set a correlation threshold of 75%. Any variables with correlation values higher than this will be dropped from the dataset.
df_corr = cor(df_train)
hc = findCorrelation(df_corr, cutoff=0.75) # Removing variables with greater than 75% correlation
hc = sort(hc)
df_train_new = df_train[,-c(hc)]
Now that we have removed the variables that observed to show multi-collinearity, let us now see how the revised dataset looks like.
head(df_train_new) %>%
DT::datatable(width = 500, height = 500, options = list(pageLength = 6))
Let us now create the correlation plot of the revised dataset.
corrplot(cor(df_train_new), # Correlation matrix
method = "number", # Correlation plot method
type = "full", # Correlation plot style (also "upper" and "lower")
diag = TRUE, # If TRUE (default), adds the diagonal
tl.col = "black", # Labels color
bg = "white", # Background color
title = " ", # Main title
col = NULL,
number.cex = 0.6,
tl.cex = .5)
Figure 4.2: Correlation plot of revised dataset
Figure 4.2 depicts the correlation values of all the variables which do not observe to demonstrate multi-collinearity.
Now that we have figured out the variables of interest, we will perform a univariate analysis of the revised dataset. One of the best ways to study the overall distribution of the variables is through a faceted histogram. Let us dive deeper.
pl1 <- ggplot(data = gather(df_train_new), aes(x = value,fill = factor(key))) + geom_histogram() + facet_wrap(~key,scales ="free_x") + theme_classic() + ggtitle("Univariate analysis of variables") + theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) + labs(y="Number of instances",x = "Value of variable")
pl1
Figure 4.3: Univariate analysis of variables
Based on our analysis of figure 4.3, we can observe that
while most variables range over a large scale in the X-axis, certain variables, namely FpDensityMorgan1 and Kappa3 range over a very small scale on the X-axis. This indicates that there is a large scale difference among the various variables. Hence, the dataset could benefit from standardisation technique at a later point of the analysis.
Now that we have performed a univariate analysis, it is now time to perform a multi-variate analysis to understand our variables better.
Let us observe how do these variables differ for each values of EC.
pl2 <- ggplot(data = df_train_new %>% filter(FpDensityMorgan1 > -5), aes(x =FpDensityMorgan1, y=Kappa3)) + geom_point(aes(color = factor(EC1))) + theme_classic() + labs(color = "EC1 indicator")
pl2
Figure 4.4: Kappa3 Vs FpDensityMorgan1 for EC1
pl2 <- ggplot(data = df_train_new %>% filter(FpDensityMorgan1 > -5), aes(x =FpDensityMorgan1, y=Kappa3)) + geom_point(aes(color = factor(EC2))) + theme_classic() + labs(color = "EC2 indicator")
pl2
Figure 4.5: Kappa3 Vs FpDensityMorgan1 for EC1
Based on the analysis from figures 4.4 and 4.5, we can observe that
💡Most values of Kappa3 lie around 0 for both EC1 and EC2. There are very few datapoints which are higher than 0. Hence, majority of the datapoints can be clustered into one group.💡
These variables were observed to have a fair amount of correlation to each other. Let us try to visualise these parameters.
pl3 <- ggplot(data = df_train_new, aes(x = EState_VSA2, y= PEOE_VSA8,color = factor(EC1))) + geom_point() + theme_classic() + labs(color = "EC1 indicator")
pl3
Figure 4.6: EState_VSA2 Vs PEOE_VSA8 for EC1
pl4 <- ggplot(data = df_train_new, aes(x = EState_VSA2, y= PEOE_VSA8,color = factor(EC2))) + geom_point() + theme_classic() + labs(color = "EC2 indicator")
pl4
Figure 4.7: EState_VSA2 Vs PEOE_VSA8 for EC2
Based on figures 4.6 and 4.7, we can observe that
💡there is no strong relationship for the two variables in each of the indicators.💡
While we have reduced the number of variables for analysis, we still have a sizable number of variables to analyse. Let us instead plot the variable importance and choose the top 5 variables to study.
To study feature importances, let us use the XGBoost algorithm.
set.seed(101)
df_train_new_EC1 <- df_train_new %>% select(-EC2)
df_train_new_EC2 <- df_train_new %>% select(-EC1)
sample_EC1=sample.split(df_train_new_EC1$EC1,SplitRatio=0.7)
train_EC1=subset(df_train_new_EC1,sample_EC1==T)
test_EC1=subset(df_train_new_EC1,sample_EC1==F)
sample_EC2=sample.split(df_train_new_EC2$EC2,SplitRatio=0.7)
train_EC2=subset(df_train_new_EC2,sample_EC2==T)
test_EC2=subset(df_train_new_EC2,sample_EC2==F)
🥳 Through the above code-chunk, we have successfully created separate train and test datasets for both the EC1 and EC2 indicators. The dataset is now ready for further processing.
xgb_model_EC1 <- xgboost(data = as.matrix(train_EC1 %>% select(-EC1)), label = as.matrix(train_EC1$EC1),
max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
## [1] train-logloss:0.604235
## [2] train-logloss:0.592298
xgb_model_EC2 <- xgboost(data = as.matrix(train_EC2 %>% select(-EC2)), label = as.matrix(train_EC2$EC2),
max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
## [1] train-logloss:0.499218
## [2] train-logloss:0.494251
xgb_EC1_importance <- xgb.importance(colnames(train_EC1 %>% select(-EC1)), model = xgb_model_EC1,
data = as.matrix(train_EC1 %>% select(-EC1)), label = train_EC1$EC1)
xgb_EC2_importance <- xgb.importance(colnames(train_EC2 %>% select(-EC2)), model = xgb_model_EC2,
data = as.matrix(train_EC2 %>% select(-EC2)), label = train_EC2$EC2)
pl5 <- ggplot(data =xgb_EC1_importance,aes(x = reorder(Feature,-Gain),y = round(Gain,2),fill=Feature)) + geom_col(color='black') + geom_label(aes(label = round(Gain,2))) + theme_classic() + ggtitle("Top 5 feature importances for EC1 \n using XGBoost") + labs(x = "Feature name",y = "Feature importance") + theme(plot.title = element_text(hjust = 0.5),legend.position = 'none',axis.title = element_text(face = 'bold'))
pl5
Figure 4.8: Feature importances for EC1 using XGBoost
pl6 <- ggplot(data =xgb_EC2_importance,aes(x = reorder(Feature,-Gain),y = round(Gain,2),fill=Feature)) + geom_col(color='black') + geom_label(aes(label = round(Gain,2))) + theme_classic() + ggtitle("Top 5 feature importances for EC2 \n using XGBoost") + labs(x = "Feature name",y = "Feature importance") + theme(plot.title = element_text(hjust = 0.5),legend.position = 'none',axis.title = element_text(face = 'bold')) + scale_fill_brewer(palette = 'Accent')
pl6
Figure 4.9: Feature importances for EC2 using XGBoost
💡 we have obtained the top 5 features using the XGBoost model for each of the EC1 and EC2 indicators. This will help us concentrate our EDA efforts only on the most important features.💡
df_train_new_EC1_scaled <- as.data.frame(df_train_new_EC1 %>% filter(EC1 == 1) %>% scale(center=TRUE,scale=TRUE))
EC1_imp <- as.data.frame(df_train_new_EC1_scaled %>% select(-EC1) %>% summarise_all(median,na.rm=TRUE))
EC1_imp <- rbind(rep(1,ncol(df_train_new_EC1_scaled -1)) , rep(-1,ncol(df_train_new_EC1_scaled -1)),EC1_imp)
df_train_new_EC2_scaled <- as.data.frame(df_train_new_EC2 %>% filter(EC2 == 1) %>% scale(center=TRUE,scale=TRUE))
EC2_imp <- as.data.frame(df_train_new_EC2_scaled %>% select(-EC2) %>% summarise_all(median,na.rm=TRUE))
EC2_imp <- rbind(rep(1,ncol(df_train_new_EC2_scaled -1)) , rep(-1,ncol(df_train_new_EC2_scaled -1)),EC2_imp)
EC_imp <- EC1_imp
EC_imp <- rbind(EC1_imp,EC2_imp[3,])
rownames(EC_imp) <- 1:nrow(EC_imp)
areas <- c(rgb(1, 0, 0, 0.3),
rgb(0, 1, 0, 0.3))
radarchart(EC_imp,
axistype = 1,
cglcol="gray", cglty=1, axislabcol="gray", caxislabels=seq(-1,1,0.5), cglwd=0.8,
pcol = 2:4, # Color for each line
plwd = 4, # Width for each line
plty = 1, # Line type for each line
pfcol = areas, # Color of the areas
vlcex=0.5, # Size of label
title=paste("Median standaridised values for EC1 and EC2 indicators")
)
legend("topright",
legend = paste(c("EC1","EC2")),
bty = "n", pch = 20, col = areas,
text.col = "grey25", pt.cex = 2)
Figure 4.10: Median standaridised values for EC1 and EC2 indicators
After analysing 4.10, we observe that
💡 The median standardised values for both EC1 and EC2 indicators are very similar. The major difference however can be observed for the variable PEOE_VSA7. EC2 is observed to demonstrate a relatively lower scaled value when compared to EC1.💡
As explained in 4.2, the dataset may benefit through a standardisation process. Let us standardise the variables to the same values as observed in figure 4.10.
df_train_new_EC1_scaled <- as.data.frame(df_train_new_EC1[,1:ncol(df_train_new_EC1) - 1] %>% scale(center = TRUE,scale = TRUE))
df_train_new_EC1_scaled <- cbind(df_train_new_EC1_scaled,EC1 = df_train_new_EC1$EC1)
df_train_new_EC2_scaled <- as.data.frame(df_train_new_EC2[,1:ncol(df_train_new_EC2) - 1] %>% scale(center = TRUE,scale = TRUE))
df_train_new_EC2_scaled <- cbind(df_train_new_EC2_scaled,EC2 = df_train_new_EC2$EC2)
set.seed(101)
sample_EC1=sample.split(df_train_new_EC1_scaled$EC1,SplitRatio=0.7)
train_EC1=subset(df_train_new_EC1_scaled,sample_EC1==T)
test_EC1=subset(df_train_new_EC1_scaled,sample_EC1==F)
sample_EC2=sample.split(df_train_new_EC2_scaled$EC2,SplitRatio=0.7)
train_EC2=subset(df_train_new_EC2_scaled,sample_EC2==T)
test_EC2=subset(df_train_new_EC2_scaled,sample_EC2==F)
We will first apply the logistic regression algorithm to perform classification for EC1 and EC2 indicators.
model_logit_EC1 <- glm(EC1~.,family=binomial(link='logit'),data=train_EC1)
pR2(model_logit_EC1)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -6.179552e+03 -6.603589e+03 8.480734e+02 6.421306e-02 7.840331e-02
## r2CU
## 1.089552e-01
model_logit_EC2 <- glm(EC2~.,family=binomial(link='logit'),data=train_EC2)
pR2(model_logit_EC2)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -5.165338e+03 -5.212117e+03 9.355681e+01 8.974935e-03 8.967523e-03
## r2CU
## 1.415614e-02
fitted.resultsEC1 <- predict(model_logit_EC1,newdata=subset(test_EC1,select=-(EC1)),type='response')
fitted.resultsEC1 <- ifelse(fitted.resultsEC1 > 0.5,1,0)
misClasificError <- mean(fitted.resultsEC1 != test_EC1$EC1)
print(paste('Accuracy of logistic regression:',1-misClasificError))
## [1] "Accuracy of logistic regression: 0.690406650190968"
fitted.resultsEC2 <- predict(model_logit_EC2,newdata=subset(test_EC2,select=-(EC2)),type='response')
fitted.resultsEC2 <- ifelse(fitted.resultsEC2 > 0.5,1,0)
misClasificError <- mean(fitted.resultsEC2 != test_EC2$EC2)
print(paste('Accuracy of logistic regression:',1-misClasificError))
## [1] "Accuracy of logistic regression: 0.79874213836478"
draw_confusion_matrix <- function(cm) {
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
rect(150, 430, 240, 370, col='#3F97D0')
text(195, 435, 'False', cex=1.2)
rect(250, 430, 340, 370, col='#F7AD50')
text(295, 435, 'True', cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col='#F7AD50')
rect(250, 305, 340, 365, col='#3F97D0')
text(140, 400, 'False', cex=1.2, srt=90)
text(140, 335, 'True', cex=1.2, srt=90)
# add in the cm results
res <- as.numeric(cm$table)
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
As we can observe,
💡 the logistic regression model was able to accurately predict approximately 70% of the cases with EC1 and EC2 indicators .💡
Let us further study the performance of the logistic regression model through the Receiver Operating Curve (ROC) metric.
p <- as.numeric(predict(model_logit_EC1, newdata=subset(test_EC1,select=-c(EC1)), type="response"))
q <- as.numeric(predict(model_logit_EC2, newdata=subset(test_EC2,select=-c(EC2)), type="response"))
pr <- prediction(p, test_EC1$EC1)
po <- prediction(q,test_EC2$EC2)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
prf2 <- performance(po, measure = "tpr", x.measure = "fpr")
plot( prf, col = 'blue')
plot(prf2, add = TRUE, col = 'red')
legend("right", c("EC1", "EC2"), lty=1,
col = c("blue", "red"), bty="n", inset=c(0,-0.15))
title("Receiver Operating Curve for Logistic Regression")
Figure 5.1: Receiver Operating Curve for Logistic Regression
Based on the analysis of the ROC plot in figure 5.1, we observe that,
💡 the logistic model is able to predict the cases for EC1 better than EC2 due to the curve covering higher area under the true positivity rate. 💡
cm_logit_EC1 <- confusionMatrix(factor(fitted.resultsEC1),factor(test_EC1$EC1))
draw_confusion_matrix(cm_logit_EC1)
Figure 5.2: Confusion matrix for Logistic Regression EC1
cm_logit_EC2 <- confusionMatrix(factor(fitted.resultsEC2),factor(test_EC2$EC2))
draw_confusion_matrix(cm_logit_EC2)
Figure 5.3: Confusion matrix for Logistic Regression EC2
Let us try to use an extra gradient boosted ensemble method commonly termed as the XGboost classifier.
xgb_model_EC1 <- xgboost(data = as.matrix(train_EC1 %>% select(-EC1)), label = as.matrix(train_EC1$EC1),
max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
## [1] train-logloss:0.604235
## [2] train-logloss:0.592298
xgb_model_EC2 <- xgboost(data = as.matrix(train_EC2 %>% select(-EC2)), label = as.matrix(train_EC2$EC2),
max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
## [1] train-logloss:0.499218
## [2] train-logloss:0.494251
pred_xgb_EC1 <- predict(xgb_model_EC1, as.matrix(test_EC1 %>% select(-c(EC1))))
pred_xgb_EC1 <- if_else(pred_xgb_EC1 > 0.5,1,0)
pred_xgb_EC2 <- predict(xgb_model_EC2, as.matrix(test_EC2 %>% select(-c(EC2))))
pred_xgb_EC2 <- if_else(pred_xgb_EC2 > 0.5,1,0)
cm_xgb_EC1 <- confusionMatrix(factor(pred_xgb_EC1),factor(test_EC1$EC1))
draw_confusion_matrix(cm_xgb_EC1)
Figure 5.4: Confusion matrix of the XGBoost model for EC1
cm_xgb_EC2 <- confusionMatrix(factor(pred_xgb_EC2),factor(test_EC2$EC2))
draw_confusion_matrix(cm_xgb_EC2)
Figure 5.5: Confusion matrix of the XGBoost model for EC2